REMARKS-General 



Response to Objections of Claims 1-8 
Claims 1-8 have been amended for the misuse of the periods. 
Response to Rejection of Claim 1 under 35 USC 103 

1. With respect to claim 1 the examiner thinks Chang et al. (Chang, US 6,704,728) 
disclose step a and b, and Douglas Russell Judd et al. (Judd, US Patent Publication 
2004/0039734) disclose step c-1 to c-3. 

2. The step (a) includes generating a query token sequence ! having at least a 
query token, from a query submitted by a user. Chang discloses that a 
computer receives a query by a user. The query is normalized as pretext. The 
normalized forms typically include word stems only, that is, the suffixes are 
removed, (col. 2, lines 56-67). Chang teaches how to normalize a query, but 



there is no token being introduced 



3. The step (b) includes generating at least a representative token sequence , 
having at least a document token, from each of said text documents that 
contain at least one token of said guery token seguence (col. 9, lines 49-60). 
Chang teaches the grammar tokens are searched from the grammar index . 

4. The step (c-1 ) includes determining a token appearance score by measuring 
a token appearance of said representative token sequence with respect to said 
query token sequence. But the score taught by Judd in paragraphs 56, 58, and 
68 is related to the quantity of the desired searching terms . 



5. The step (c-2) includes determining a token order score by; measuring a token 
order of said representative token sequence with respect to said query token 
sequence. But the score taught by Judd in paragraphs 26, and 67 is related to 
the order between the matching terms . 
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6. The step (c-3) includes determining a token consecutiveness score by 
measuring a token consecutiveness of said representative token sequence 
with respect to said query token sequence. But the score! taught by Judd in 
paragraphs 57, and 64 is related to the distance between the matching terms 

7. The step (d) is related to the ranking order in accordance with the token 
appearance score, token order score, and token consecutiveness score . But 



the above three kinds of scores are not disclosed by Judd. 



8. To sum up, claim 1 is not disclosed by Chang in view of Judd. Therefore, the 
application should be patentable, and its dependant claims 2-20 should be 
patentable as well. 



Respectfully submitted, 



WekJao, CHEN 
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Title 



Sequence Based Indexing and Retrieval Method for Text Documents 

Background of the Present Invention 
Field of Invention 

5 The present invention relates to a database search engine and, more particularly, 

to a sequence based indexing and retrieval method for a collection of text documents, 
which is adapted to produce a ranked list of the text documents relative to a user's query 
by matching representative token sequences of each document in the collection against 
the token sequence of the query. 

10 Description of Related Arts 

The main task of a text retrieval system is to help the user find, from a 
collection of text documents, those that are relevant to his query. The system usually 
creates an index for the text collection to accelerate the search process. Inverted indices 
(files) are a popular way for such indexing. For each token (word or character), the index 
15 records the identifier of every document containing the token. Some extension of 
inverted indices records not only which documents contain a particular token, but also the 
positions where in a document the token appears. 

Traditional text retrieval models (such as the boolean model and the vector 
model) are only concerned with the existence of a token in the target document and are 

20 insensitive to token order or position. Given a query "United Nations," a traditional 
retrieval system would consider a document with both "United" and "Nation" (after 
stemming) as equally relevant as a document that actually contains the phrase "United 
Nations." One solution to this problem is to index phrases, which would considerably 
increase the size of the index and require the use of a dictionary. An alternative is for a 

25 retrieval system to utilize positional information. If the system takes positional 
information into account, a document that contains "United" and "Nations" in 
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consecutive positions will be ranked higher than a document with both words in separate 
positions. The present invention exploits positional information to its fullest potential. 

Summary of the Present Invention 

A main object of the present invention is to provide a sequence based indexing 
5 and retrieval method for a collection of text documents, which treats the documents and 
queries as sequences of token-position pairs and estimates the similarity between the 
document and query, so as to enhance the retrieval effectiveness while performing the 
query on the text documents. 

Another object of the present invention is to provide a sequence based indexing 
10 and retrieval method for a collection of text documents, wherein the similarity 
measurement includes the token appearance, the token order, and the token 
consecutiveness, such that the approximate matching and fault-tolerant capability are 
substantially enhanced so as to precisely determine the similarity between the document 
and query. 

15 Another object of the present invention is to provide a sequence based indexing 

and retrieval method for a collection of text documents, wherein the text document is pre- 
processed to select the candidate document therefrom to match with the query token 
sequence so as to enhance the speed of the retrieval process. 

Another object of the present invention is to provide a sequence based indexing 
20 and retrieval method for a collection of text documents, wherein each of the text 
documents is indexed to measure a differentiating position of each two adjacent 
document tokens in the text document so as to enhance the process of matching the query 
token sequence with the document token sequence. 

Another object of the present invention is to provide a sequence based indexing 
25 and retrieval method for a collection of text documents, which is specifically designed as 
a flexible and modular process that is easy to adjust, modify, and add modules or 
functionalities for further development. 
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Another object of the present invention is to provide a sequence based indexing 
and retrieval method for a collection of text documents, which is adapted to process the 
text document in Chinese, English, numbers, punctuations, and symbols, so as to enhance 
the practical use of the present invention. 

5 Accordingly, in order to accomplish the above objects, the present invention 

provides a sequence based indexing and retrieval method for a text document, comprising 
the steps of: 

(a) generating a query token sequence, having at least a query token, from a 
query submitted by a user; 

10 (b) generating at least a representative token sequence, having at least a 

document token, from each of said text documents that contain at least one token of said 
query token sequence; 

(c) measuring a similarity between said query token sequence and each of 
said representative token sequences; and 

15 (d) retrieving said text documents in responsive to said similarity of said 

representative token sequence with respect to said query token sequence with a ranking 
order in accordance with a token appearance score, a token order score, and a token 
consecutiveness score, provided that for a document with two representative token 
sequences, its similarity is determined by the representative token sequence with a higher 

20 score. 

The similarity measurement is preformed by determining a token appearance 
score, a token order score, and a token consecutiveness score of the representative token 
sequence with respect to the query token sequence. Therefore, the total score of the 
token appearance, the token order, and the token consecutiveness is determined as a 
25 similarity index to illustrate the similarity between the representative token sequence and 
the query token sequence, so as to precisely and effectively retrieve the text document. 

These and other objectives, features, and advantages of the present invention 
will become apparent from the following detailed description, the accompanying 
drawings, and the appended claims. 
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Brief Description of the Drawings 

Fig. 1 is a flow chart illustrating a sequence based indexing and retrieval method for a 
collection of text documents according to a preferred embodiment of the present 
invention. 

5 Detailed Description of the Preferred Embodiment 

Referring to Fig. 1 of the drawings, a sequence based indexing and retrieval 
method for a text document according to a preferred embodiment of the present invention 
is illustrated, wherein the method comprises the following steps. 

(1) Generate a query token sequence, having at least a query token, from a 
10 query submitted by a user. 

(2) Generate at least a representative token sequence, having at least a 
document token, from each of said documents that contain at least one token of said 
query token sequence. 

(3) Measure a similarity between each of the representative token sequences 
15 and the query token sequence. 

(4) Retrieve the text documents in responsive to said similarity of said 
representative token sequence with respect to said query token sequence with a ranking 
order in accordance with a token appearance score, a token order score, and a token 
consecutiveness score, provided that for a document with two representative token 

20 sequences, its similarity is determined by the representative token sequence with a higher 
score. 

In step (1), the query may contain both English and Chinese. A "Tokenizer" 
process is preformed to transform the query text into the query token sequence. The key 
of the Tokenizer is its data analysis component. The input data of the data analysis 
25 component is text which is represented as a byte array. This component processes the 
byte array elements one by one. When encountering the first byte of a Chinese character 
(in BIG5 encoding, the first byte of a Chinese character is range form 'A4' to TF'), 
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combine it with the next byte to construct a Chinese character. When encountering an 
English letter ('4V to '5 A' and f 6T to 7A ? ), the present invention will check the next byte 
continuously until reaching a non-English and non-hyphen byte. Then, all checked 
English letters are combined to construct an English word. If we encounter a non- 
5 English and non-Chinese byte (for example, numbers), the number will be treated as an 
independent unit. y 

After the data analysis component has parsed out a Chinese character, an 
English word or others, we use the information to construct a new token by its content, 
type, and position. After we have processed all bytes, a sequence of query tokens will be 
10 constructed. 

It is worth mentioning that verb patterns vary in the rules of grammar of the 
English language, such as present tense, past tense, etc, such that the step (1) further 
comprises a step of stemming the query tokens to encode the text words into the 
corresponding word stems respectively by a stemmer. For example, the query token 
15 "connecting" is encoded to be "connect" as the origin word stem by removing the suffix 
thereof. However, for some languages, such as Chinese language, the stemming step can 
be omitted due to the rules of grammar of the language. 

After the introduction of the Tokenizer component, we now explain our method. 
First, we have to build an index for the collection of text documents. For each token, we 
20 record not only which documents contain the token but also the positions where in a 
document the token appears. For example, the index of a token in essence can be 
expressed as an extended inverted list: 

((D,, (P l3 P 2 , P 3 , ...)), (D 2 , (P,, P 2 , P 3 ...)) ...) 

According to the preferred embodiment, the step (2) further comprises a step of 
25 selecting at least a candidate document from the text documents, wherein one of the text 
documents is selected to be a candidate document when the respective text document 
contains the at least one token in the query token sequence. 

If the query token sequence contains common words, such as "we," the number 
of possible candidate documents will be large and thus will reduce the efficiency of the 
30 retrieval system. The solution is to adopt the "token weights" concept. The basic idea of 
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this approach is to eliminate tokens with low discrimination power in the query token 
sequence. Before using this approach, we have to calculate token weights first. We use 
the inverse document frequency (idf) metric as token weights. With- the weight of each 
token, we can decide a threshold to drop unimportant query tokens in candidate 
5 documents selection. 

Here we introduce the approach we designed to solve this problem. 

1 . For a query token sequence, first we will find out the token with highest 
weight (W h ) and lowest weight (Wi.) 

2. A cut-off percentage cp is given by an implementation parameter wherein 
10 cp is in the range of between 0 and 1. 

3. Check each query token in the query token sequence. If a token's weight 
is lower than Wi + cp * (W h -Wi), we determine that the query token is not as important 
as other query tokens, and does not use it to select candidate documents. 

The document token sequence of the text document is obtained as follows: for each token 
15 in a query token sequence, the extended inverted list thereof is obtained from the index; 
and all lists are combined to construct the document token sequences. 

After the document token sequence is chosen, we have to find its representative 
token sequences. A representative token sequence is a segment of the document token 
sequence. We divide a document token sequence into segments, wherein for each 
20 segment, the distance between two adjacent document tokens is no longer than a 
predetermined positioning value. Two longest segments of the document token sequence 
are selected as representative token sequences. Here we give an example: 

The query token sequence: A1B2 

The document: AXXBABXXXBAXXXBABABBXXXBA 
25 The given threshold (predetermined positioning value): 3 
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After the division, we obtain the following four segments: A1B4A5B6, BioAn, 
Bi5Ai6B]7Ai8Bi9B20, B24A25. The two longest segments, i.e., A1B4A5B6 and 
Bi5Ai6Bi7Ai8Bi 9 B2o 3 will be the representative token sequences of this document. 



To summarize, the two longest segments of the document token sequence are 
5 selected as representative token sequences wherein the positional differentiation of each 
adjacent document tokens is no larger than a predetermined positioning value while said 
corresponding text document is selected as the said candidate document. 

The following example mainly illustrates the generation of representative token 
sequence in form of Chinese language. 

10 The text document is shown as: 

Doc #134 

CD 
PPo 

15 The query is input as "it3l#," wherein the query is transformed into the query 

token sequence by a Tokenizer as 1812 #3" while the indices of the relevant document 
tokens are shown as below: 

Extended Inverted Lists: 

m ,(Doc#134,(l, 41, 54, 65, 81)),(Doc#135, 

20 M ,(Doc#134,(45)),(Doc#135, 

# ,(Doc#134,(47)),(Doc#135, 

Reconstruction of the document token sequences (on the basis that the query 
token sequence is jt 1 H 2 # 3): 

25 

D0C#134*1«4 ll45t47»54«6 5 »8, 

Doc#135 
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With a given threshold (a predetermined positioning value) 3, the document 
token sequence "X il4il45t47l54l65l8," of Doc#134 is formed into five 

segments which are "3* 1 , * 4 1 JR 4 5 # 4 i " "* s 4 "K e 5" and "* a 1 ." Accordingly, 
the two longest segments of the document token sequences "fii" and "it 4 5 "it 4 7" 
5 are selected in this example as representative token sequences for determining the 
similarity between the between the query token sequence and the document token 
sequence. 

According to the preferred embodiment, the step (3) further comprises the 
following steps, wherein D = (d iiJ d i2 ,...,d. ) (of m tokens) and 

10 C = (04,0, a )(°f n tokens) respectively denote the representative token sequence 

and the query token sequence under similarity measurement. 

(3.1) Determine a token appearance (TA) score by measuring a token 
appearance of the query representative token sequence with respect to the query token 

1 5 sequence. 

(3.2) Determine a token order (TO) score by measuring a token order of the 
representative token sequence with respect to the query token sequence. 

(3.3) Determine a token consecutiveness (TC) score by measuring a token 
consecutiveness of the representative token sequence with respect to the query token 

20 sequence. 

The step (3.1) comprises the following sub-steps. 

(3.1.1) Consult an index of said text documents to determine the weight of 
each token in the query token sequence. 

(3.1.2) Calculate a sum of the weights of the query tokens that appear in 
25 the representative token sequence. 

(3.1.3) Output a token appearance score of the token appearance by 
calculating the fraction of the sum divided by the total weight of all query tokens. 
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As mentioned above, the weight of a query token is measured by (idf + 1). 
Accordingly, the following equation illustrates the determination of the token appearance 
TA. 

5 Token Appearance (TA): 

10 wherein w(q fj ) represents the weight of the "jth" query token. 

Accordingly, t(g g ) = 1 if the "jth" query token is shown in the representative 
token sequence and t(q s ) = 0 if the "jth" query token is not shown in the representative 
token sequence. 

The object of the token order (TO) measurement is to capture the 
15 word/character ordering, wherein the step (3.2) comprises the following sub-steps. 

(3.2.1) Determine a length of the longest common subsequence of the 
representative token sequence and the query token sequence; 

(3.2.2) Determine a length of the representative token sequence; 

(3.2.3) Determine a length of the query token sequence; and 

20 (3.2.4) Output the token order score of said token order by calculating a 

fraction of the length of the longest common subsequence divided by an average sum of 
the length of the representative token sequence and the length of the query token 
sequence. 

Accordingly, the equation for the token order TO is: 
25 Token Ordering (TO): 
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ro(A0= |LC5(A01 



(iD|+iei)+2- 



where LCS(D,Q) is the longest common subsequence of D and Q and |S| denotes 
the length of sequence S. 

The object of the token consecutiveness (TC) measurement is to capture the 
distribution of the query tokens, wherein the step (3.3) further comprises the following 
sub-steps. 



(3.3.1) Determine a relative distance between a positional differentiation 
of each adjacent document tokens and a positional differentiation of said adjacent 
document tokens in the query token sequence. 

(3.3.2) Output the token consecutiveness score of the token 
consecutiveness by calculating a fraction of a sum of the inverses of the relative distances 
divided by the number of pairs of adjacent tokens, which equals the length of the 
representative token sequence less one. 

Token Consecutiveness (TC): 

m-l 1 

TC(D 9 Q)= -L, 

m-l 

where rd } = | {i j+ -i j y{pos{d i j+lf Q)-pos(d ijt Q)) | + 1 where pos(j k , Q) gives the position 
of tin Q. When there are more than one possible values for pos(d i f Q) or pos(d t ,g),the 
values may be chosen such that | {i j+ -i j )-{pos{d i yt Q)-pos{d i ,Q)) | is as small as possible. 



The above three measures all have a score ranging from 0 to 1. A linear 

combination (weighted sum) of the measures (which also ranges from 0 to 1) can be 
calculated from a l TA(D,Q) + a 2 TO( D, Q) + a^TC(D,Q) with a suitable selection ofa l9 a 2 , 
and a 3 such that or, +a 2 +a 3 = 1 . An implementation may allow the user to select the 

coefficients. 



Therefore, the similarity of the query token sequence is calculated by summing 
the token appearance score, the token order score, and the token consecutiveness score. 
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The result shown below illustrates the determination of the similarity between 
the representative token sequence and the query token sequence. 

Following the earlier example, we consider measuring the similarity between 
the representative token sequence "I4 1 5! 4 5#4 7" and the query token sequence "X11 2 
5 #3." 

Token appearance TA of the query token sequence: 

TA = (l*(l/3)+l*(l/3)+l*(l/3))/(l/3+l/3+l/3)=l 

Token order TO of the query token sequence: TO = 3/((3+3)/2)=l 

Token consecutiveness TC of the query token sequence: di=l+|(45-41)-(2- 
10 1)|=4; d 2 =l+|(47-45)-(3-2)|=2; TC = ((l/4)+(l/2))/2=0.375 

The similarity: 1*1/3 + 1*1/3 + 1*0.375 = 0.792 

The following experimental results illustrate the accuracy of the search result by 
using the present invention in comparison with the bigram method. 

Experiment 1 illustrates the query including a person name and the prefix 

15 thereof 

Query: RfttKlifcKjg; wherein "IStKJI" is the name of a person and "3&$t" is a 
prefix of the person. 
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Experiment 2 illustrates the query including two person names and a connecting 
word therebetween. 

Query: Mlgfl3M; wherein "mMlS" and "2E»S" are the names of the 
5 person and "H" is the connecting word for "^ffiW and "2Eiiyg." 
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Experiment 3 illustrates the query including the abbreviation of a noun phrase. 
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Therefore, the approximate matching and fault-tolerant capabilities are 
substantially enhanced so as to efficiently and precisely retrieve text documents with 
respect to the query submitted by the user. 

One skilled in the art will understand that the embodiment of the present 
5 invention as shown in the drawings and described above is exemplary only and not 
intended to be limiting. 

It will thus be seen that the objects of the present invention have been fully and 
effectively accomplished. Its embodiments have been shown and described for the 
purposes of illustrating the functional and structural principles of the present invention 
10 and is subject to change without departure from such principles. Therefore, this invention 
includes all modifications encompassed within the spirit and scope of the following 
claims. 
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What is Claimed is: 



1. A sequence based indexing and retrieval method for text documents, 
comprising the steps of: 

(a) generating a query token sequence, having at least a query token, from a 
query submitted by a user; 

(b) generating at least a representative token sequence, having at least a 
document token, from each of said text documents that contain at least one token of said 
query token sequence; 

(c) measuring a similarity between each of said representative token 
sequences and said query token sequence by: 

(c[[.]]-l) determining a token appearance score by measuring a token 
appearance of said representative token sequence with respect to said query token 
sequence; 

( c H-]k2) determining a token order score by measuring a token order of said 
representative token sequence with respect to said query token sequence; and 

(c[[.]]-3) determining a token consecutiveness score by measuring a token 
consecutiveness of said representative token sequence with respect to said query token 
sequence; and 

(d) retrieving said text documents in responsive to said similarity of said 
representative token sequence with respect to said query token sequence with a ranking 
order in accordance with said token appearance score, said token order score, and said 
token consecutiveness score, provided that for a document with two representative token 
sequences, its similarity is determined by the representative token sequence with a higher 
score. 

2. The method, as recited in claim 1, wherein the step (c[[.]]-l) comprises the 
sub-steps of: 
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( c [[-]]zl[[-]]rl) consulting an index of said text documents to determine the 
weight of each token in said query token sequence; 

( c [[-]]zl[[-]]i2) calculating a sum of the weights of the query tokens that 
appear in said representative token sequence; and 

( c [[-]]zl[[-]]z3) outputting said token appearance score of said token 
appearance by calculating a fraction of said sum divided by the total weight of all query 
tokens. 

3. The method, as recited in claim 2, wherein said weight of said query token 
in said query token sequence is measured by determining a token frequency of said query 
token in said text documents. 

4. The method, as recited in claim 1, wherein the step (c[[.]]-2) comprises the 
sub-steps of: 

( c H-]]z2[[.]]zl) determining a length of the longest common subsequence 
of said representative token sequence and said query token sequence; 

( c H*]t2[[.]]z2) determining a length of said representative token sequence; 

( c [[ ]]z2[[.]]z3) determining a length of said query token sequence; and 

( c [[-]]-2[[.]]-4) outputting said token order score of said token order by 
calculating a fraction of said length of said longest common subsequence divided by an 
average sum of said length of said representative token sequence and said length of said 
query token sequence. 

5. The method, as recited in claim 3, wherein the step (c[[.]]-2) comprises the 
sub-steps of: 

( c [[ ]]z2[[.]]il) determining a length of the longest common subsequence 
of said representative token sequence and said query token sequence; 

( c [[ ]]z2[[.]]-2) determining a length of said representative token sequence; 
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(c[[0]=2[[.]fc3) 



determining a length of said query token sequence; and 



( c [[-]]z2[[.]k4) outputting said token order score of said token order by 
calculating a fraction of said length of said longest common subsequence divided by an 
average sum of said length of said representative token sequence and said length of said 
query token sequence. 

6. The method, as recited in claim 1, wherein the step (c[[.]] z 3)) comprises 
the sub-steps of: 

determining a relative distance between a positional 
differentiation of each adjacent document tokens and a positional differentiation of said 
adjacent document tokens in said query token sequence; and 

( c [[-]]r3[[.]]i2) outputting said token consecutiveness score of said token 
consecutiveness by calculating a fraction of a sum of the inverses of said relative 
distances divided by the number of pairs of adjacent tokens, which equals the length of 
said representative token sequence less one. 

7. The method, as recited in claim 3, wherein the step (c[[.]] 2 3) comprises the 
sub-steps of: 

( c [[-]]z3[[.]]zl) determining a relative distance between a positional 
differentiation of each adjacent document tokens and a positional differentiation of said 
adjacent document tokens in said query token sequence; and 

( c [[-]]z3[[.]]z2) outputting said token consecutiveness score of said token 
consecutiveness by calculating a fraction of a sum of the inverses of said relative 
distances divided by the number of pairs of adjacent tokens, which equals the length of 
said representative token sequence less one. 

8. The method, as recited in claim 5, wherein the step (c[[.]] z 3) comprises the 
sub-steps of: 



16 



( c [[-]]z3[[-]]zl) determining a relative distance between a positional 
differentiation of each adjacent document tokens and a positional differentiation of said 
adjacent document tokens in said query token sequence; and 

( c [[-]]z3[[-]]z2) outputting said token consecutiveness score of said token 
5 consecutiveness by calculating a sum of the inverses of said relative distances with 
respect to said representative token sequence. 

9. The method, as recited in claim 8, wherein said similarity of said 
representative token sequence is calculated with respect to said query token sequence by 
summing said token appearance score, said token order score, and said token 

10 consecutiveness score, wherein said ranking order of said text documents is determined 
by a weighted sum of said token appearance score, said token order score, and said token 
consecutiveness score of each of said representative token sequences of said text 
documents. 

10. The method as recited in claim 1, in step (b), further comprising a step of 
15 selecting at least a candidate document from said text documents, wherein one of said 

text documents is selected to be said candidate document when said text document 
contains at least one token of said query token sequence. 

1 1 . The method as recited in claim 9, in step (b), further comprising a step of 
selecting at least a candidate document from said text documents, wherein one of said 

20 text documents is selected to be said candidate document when said text document 
contains at least one token of said query token sequence. 

12. The method as recited in claim 10, in step (b), further comprising a step of 
consulting an index of said text documents to establish said candidate document, wherein 
tokens that also appear in the query token sequence are collected to form a document 

25 token sequence for each document and the two longest segments of said document token 
sequence are selected as representative token sequences wherein the positional 
differentiation of each adjacent document tokens is no larger than a predetermined 
positioning value while said corresponding text document is selected as the said 
candidate document. 
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13. The method as recited in claim 1 1, in step (b), further comprising a step of 
consulting an index of said text documents to establish said candidate document, wherein 
tokens that also appear in the query token sequence are collected to form a document 
token sequence for each document and the two longest segments of said document token 

5 sequence are selected as representative token sequences wherein the positional 
differentiation of each adjacent document tokens is no larger than a predetermined 
positioning value while said corresponding text document is selected as the said 
candidate document. 

14. The method as recited in claim 10, in step (b), further comprising a step of 
10 retaining said candidate document to be used for measuring said similarity with respect to 

said query token sequence, wherein the said candidate document is retained when said 
candidate document contains a token that has a weight no less than a predetermined 
fraction of the total weight of query tokens. 

15. The method as recited in claim 1 1, in step (b), further comprising a step of 
15 retaining said candidate document to be used for measuring said similarity with respect to 

said query token sequence, wherein the said candidate document is retained when said 
candidate document contains a token that has a weight no less than a predetermined 
fraction of the total weight of query tokens. 

16. The method as recited in claim 13, in step (b), further comprising a step of 
20 retaining said candidate document to be used for measuring said similarity with respect to 

said query token sequence, wherein the said candidate document is retained when said 
candidate document contains a token that has a weight no less than a predetermined 
fraction of the total weight of query tokens. 

17. The method, as recited in claim 1, wherein said text document contains 
25 Chinese characters, English words, numbers, punctuations, and symbols as said document 

tokens. 

18. The method, as recited in claim 9, wherein said text document contains 
Chinese characters, English words, numbers, punctuations, and symbols as said document 
tokens. 
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19. The method, as recited in claim 13, wherein said text document contains 
Chinese characters, English words, numbers, punctuations, and symbols as said document 
tokens. 

20. The method, as recited in claim 16, wherein said text document contains 
Chinese characters, English words, numbers, punctuations, and symbols as said document 
tokens. 
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Sequence Based Indexing and Retrieval Method for Text Documents 



Abstract of the Disclosure 

A sequence based indexing and retrieval method for a collection of text 
documents includes the steps of generating a query token sequence from a query; 

5 generating at least a representative token sequence from each of the documents that 
contain at least one token of the query token sequence; measuring a similarity between 
each of the representative token sequences and the query token sequence; and retrieving 
the text document in responsive to the similarity of the representative query token 
sequence with respect to the query token sequence. The similarity measurement is 

10 preformed by determining a token appearance score, a token order score, and a token 
consecutiveness score of the representative token sequence with respect to the query 
token sequence, so as to illustrate the similarity between the representative token 
sequence and the query token sequence for precisely and effectively retrieving the text 
document. 
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